INTERSPEECH.2018 - Speech Processing

Total: 43

#1 Permutation Invariant Training of Generative Adversarial Network for Monaural Speech Separation [PDF] [Copy] [Kimi1]

Authors: Lianwu Chen ; Meng Yu ; Yanmin Qian ; Dan Su ; Dong Yu

We explore generative adversarial networks (GANs) for speech separation, particularly with permutation invariant training (SSGAN-PIT). Prior work demonstrates that GANs can be implemented for suppressing additive noise in noisy speech waveform and improving perceptual speech quality. In this work, we train GANs for speech separation which enhances multiple speech sources simultaneously with the permutation issue addressed by the utterance level PIT in the training of the generator network. We propose operating GANs on the power spectrum domain instead of waveforms to reduce computation. To better explore time dependencies, recurrent neural networks (RNNs) with long short-term memory (LSTM) are adopted for both generator and discriminator in this study. We evaluated SSGAN-PIT on the WSJ0 two-talker mixed speech separation task and found that SSGAN-PIT outperforms SSGAN without PIT and the neural networks based speech separation with or without PIT. The evaluation confirms the feasibility of the proposed model and training approach for efficient speech separation. The convergence behavior of permutation invariant training and adversarial training are analyzed.

#2 Deep Extractor Network for Target Speaker Recovery from Single Channel Speech Mixtures [PDF] [Copy] [Kimi1]

Authors: Jun Wang ; Jie Chen ; Dan Su ; Lianwu Chen ; Meng Yu ; Yanmin Qian ; Dong Yu

Speaker-aware source separation methods are promising workarounds for major difficulties such as arbitrary source permutation and unknown number of sources. However, it remains challenging to achieve satisfying performance provided a very short available target speaker utterance (anchor). Here we present a novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space and pulls together the time-frequency bins corresponding to the target speaker. The proposed model is different from prior works that the carnonical embedding space encodes knowledges of both the anchor and the mixture during training phase: first, embeddings for the anchor and mixture speech are separately constructed in a primary embedding space and then combined as an input to feed-forward layers to transform to a carnonical embedding space which we discover more stable than the primary one. Experimental results show that given a very short utterance, the proposed model can efficiently recover high quality target speech from a mixture, which outperforms various baseline models, with 5.2% and 6.6% relative improvements in SDR and PESQ respectively compared with a baseline oracle deep attracor model. Meanwhile, we show it can be generalized well to more than one interfering speaker.

#3 Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network [PDF] [Copy] [Kimi1]

Authors: Weipeng He ; Petr Motlicek ; Jean-Marc Odobez

We propose a novel multi-task neural network-based approach for joint sound source localization and speech/non-speech classification in noisy environments. The network takes raw short time Fourier transform as input and outputs the likelihood values for the two tasks, which are used for the simultaneous detection, localization and classification of an unknown number of overlapping sound sources, Tested with real recorded data, our method achieves significantly better performance in terms of speech/non-speech classification and localization of speech sources, compared to method that performs localization and classification separately. In addition, we demonstrate that incorporating the temporal context can further improve the performance.

#4 Detection of Glottal Closure Instants from Speech Signals: A Convolutional Neural Network Based Method [PDF] [Copy] [Kimi1]

Authors: Shuai Yang ; Zhiyong Wu ; Binbin Shen ; Helen Meng

Most conventional methods to detect glottal closure instants (GCI) are based on signal processing technologies and different GCI candidate selection methods. This paper proposes a classification method to detect glottal closure instants from speech waveforms using convolutional neural network (CNN). The procedure is divided into two successive steps. Firstly, a low-pass filtered signal is computed, whose negative peaks are taken as candidates for GCI placement. Secondly, a CNN-based classification model determines for each peak whether it corresponds to a GCI or not. The method is compared with three existing GCI detection algorithms on two publicly available databases. For the proposed method, the detection accuracy in terms of F1-score is 98.23%. Additional experiment indicates that the model can perform better after trained with the speech data from the speakers who are the same as those in the test set.

#5 Robust TDOA Estimation Based on Time-Frequency Masking and Deep Neural Networks [PDF] [Copy] [Kimi1]

Authors: Zhong-Qiu Wang ; Xueliang Zhang ; DeLiang Wang

Deep learning based time-frequency (T-F) masking has dramatically advanced monaural speech separation and enhancement. This study investigates its potential for robust time difference of arrival (TDOA) estimation in noisy and reverberant environments. Three novel algorithms are proposed to improve the robustness of conventional cross-correlation-, beamforming- and subspace-based algorithms for speaker localization. The key idea is to leverage the power of deep neural networks (DNN) to accurately identify T-F units that are relatively clean for TDOA estimation. All of the proposed algorithms exhibit strong robustness for TDOA estimation in environments with low input SNR, high reverberation and low direction-to-reverberant energy ratio.

#6 Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks [PDF] [Copy] [Kimi1]

Authors: Akihiro Kato ; Tomi Kinnunen

The fundamental frequency (F0) represents pitch in speech that determines prosodic characteristics of speech and is needed in various tasks for speech analysis and synthesis. Despite decades of research on this topic, F0 estimation at low signal-to-noise ratios (SNRs) in unexpected noise conditions remains difficult. This work proposes a new approach to noise robust F0 estimation using a recurrent neural network (RNN) trained in a supervised manner. Recent studies employ deep neural networks (DNNs) for F0 tracking as a frame-by-frame classification task into quantised frequency states but we propose waveform-to-sinusoid regression instead to achieve both noise robustness and accurate estimation with increased frequency resolution. Experimental results with PTDB-TUG corpus contaminated by additive noise (NOISEX-92) demonstrate that the proposed method improves gross pitch error (GPE) rate and fine pitch error (FPE) by more than 35% at SNRs between -10 dB and +10 dB compared with well-known noise robust F0 tracker, PEFAC. Furthermore, the proposed method also outperforms state-of-the-art DNN-based approaches by more than 15% in terms of both FPE and GPE rate over the preceding SNR range.

#7 Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation [PDF] [Copy] [Kimi1]

Authors: Paul Magron ; Konstantinos Drossos ; Stylianos Ioannis Mimilakis ; Tuomas Virtanen

State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-time Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithms that tackle this issue and can further enhance the separation quality. These algorithms exploit phase constraints that originate from a sinusoidal model or from consistency, a property that is a direct consequence of the STFT redundancy. Experiments conducted on real music songs show that those algorithms are efficient for reducing interference in the estimated voice compared to the baseline approach.

#8 Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors [PDF] [Copy] [Kimi1]

Author: Kanru Hua

A F0 and voicing status estimation algorithm for high quality speech analysis/synthesis is proposed. This problem is approached from a different perspective that models the behavior of feature extractors under noise, instead of directly modeling speech signals. Under time-frequency locality assumptions, the joint distribution of extracted features and target F0 can be characterized by training a bank of Gaussian mixture models (GMM) on artificial data generated from Monte-Carlo simulations. The trained GMMs can then be used to generate a set of conditional distributions on the predicted F0, which are then combined and post-processed by Viterbi algorithm to give a final F0 trajectory. Evaluation on CSTR and CMU Arctic speech databases shows that the proposed method, trained on fully synthetic data, achieves lower gross error rates than state-of-the-art methods.

#9 Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network [PDF] [Copy] [Kimi1]

Authors: Yi Luo ; Nima Mesgarani

We investigate the recently proposed Time-domain Audio Separation Network (TasNet) in the task of real-time single-channel speech dereverberation. Unlike systems that take time-frequency representation of the audio as input, TasNet learns an adaptive front-end in replacement of the time-frequency representation by a time-domain convolutional non-negative autoencoder. We show that by formulating the dereverberation problem as a denoising problem where the direct path is separated from the reverberations, a TasNet denoising autoencoder can outperform a deep LSTM baseline on log-power magnitude spectrogram input in both causal and non-causal settings. We further show that adjusting the stride size in the convolutional autoencoder helps both the dereverberation and separation performance.

#10 Music Source Activity Detection and Separation Using Deep Attractor Network [PDF] [Copy] [Kimi1]

Authors: Rajath Kumar ; Yi Luo ; Nima Mesgarani

In music signal processing, singing voice detection and music source separation are widely researched topics. Recent progress in deep neural network based source separation has advanced the state of the performance in the problem of vocal and instrument separation, while the problem of joint source activity detection and separation remains unexplored. In this paper, we propose an approach to perform source activity detection using the high-dimensional embedding generated by Deep Attractor Network (DANet) when trained for music source separation. By defining both tasks together, DANet is able to dynamically estimate the number of outputs depending on the active sources. We propose an Expectation-Maximization (EM) training paradigm for DANet which further improves the separation performance of the original DANet. Experiments show that our network achieves higher source separation and comparable source activity detection against a baseline system.

#11 Improving Mandarin Tone Recognition Using Convolutional Bidirectional Long Short-Term Memory with Attention [PDF] [Copy] [Kimi1]

Authors: Longfei Yang ; Yanlu Xie ; Jinsong Zhang

Automatic tone recognition is useful for Mandarin spoken language processing. However, the complex F0 variations from the tone co-articulations and the interplay effects among tonality make it rather difficult to perform tone recognition of Chinese continuous speech. This paper explored the application of Bidirectional Long Short-Term Memory (BLSTM), which had the capability of modeling time series, to Mandarin tone recognition to handle the tone variations in continuous speech. In addition, we introduced attention mechanism to guide the model to select the suitable context information. The experimental results showed that the performance of proposed CNN-BLSTM with attention mechanism was the best and it achieved the tone error rate (TER) of 9.30% with a 17.6% relative error reduction from the DNN baseline system with TER of 11.28%. It demonstrated that our proposed model was more effective to handle the complex F0 variations than other models.

#12 Improving Sparse Representations in Exemplar-Based Voice Conversion with a Phoneme-Selective Objective Function [PDF] [Copy] [Kimi2]

Authors: Shaojin Ding ; Guanlong Zhao ; Christopher Liberatore ; Ricardo Gutierrez-Osuna

The acoustic quality of exemplar-based voice conversion (VC) degrades whenever the phoneme labels of the selected exemplars do not match the phonetic content of the frame being represented. To address this issue, we propose a Phoneme-Selective Objective Function (PSOF) that promotes a sparse representation of each speech frame with exemplars from a few phoneme classes. Namely, PSOF enforces group sparsity on the representation, where each group corresponds to a phoneme class. The sparse representation for exemplars within a phoneme class tends to activate or suppress simultaneously using the proposed objective function. We conducted two sets of experiments on the ARCTIC corpus to evaluate the proposed method. First, we evaluated the ability of PSOF to reduce phoneme mismatches. Then, we assessed its performance on a VC task and compared it against three baseline methods from previous studies. Results from objective measurements and subjective listening tests show that the proposed method effectively reduces phoneme mismatches and significantly improves VC acoustic quality while retaining the voice identity of the target speaker.

#13 Learning Structured Dictionaries for Exemplar-based Voice Conversion [PDF] [Copy] [Kimi2]

Authors: Shaojin Ding ; Christopher Liberatore ; Ricardo Gutierrez-Osuna

Incorporating phonetic information has been shown to improve the performance of exemplar-based voice conversion. A standard approach is to build a phonetically structured dictionary, where exemplars are categorized into sub-dictionaries according to their phoneme labels. However, acquiring phoneme labels can be expensive and the phoneme labels can have inaccuracies. The latter problem becomes more salient when the speakers are non-native speakers. This paper presents an iterative dictionary-learning algorithm that avoids the need for phoneme labels and instead learns the structured dictionaries in an unsupervised fashion. At each iteration, two steps are alternatively performed: cluster update and dictionary update. In the cluster update step, each training frame is assigned to a cluster whose sub-dictionary represents it with the lowest residual. In the dictionary update step, the sub-dictionary for a cluster is updated using all the speech frames in the cluster. We evaluate the proposed algorithm through objective and subjective experiments on a new corpus of non-native English speech. Compared to previous studies, the proposed algorithm improves the acoustic quality of voice-converted speech while retaining the target speaker’s identity.

#14 Exemplar-Based Spectral Detail Compensation for Voice Conversion [PDF] [Copy] [Kimi1]

Authors: Yu-Huai Peng ; Hsin-Te Hwang ; Yichiao Wu ; Yu Tsao ; Hsin-Min Wang

Most voice conversion (VC) systems are established under the vocoder-based VC framework. When performing spectral conversion (SC) under this framework, the low-dimensional spectral features, such as mel-ceptral coefficients (MCCs), are often adopted to represent the high-dimensional spectral envelopes. The joint density Gaussian mixture model (GMM)-based SC method with the STRAIGHT vocoder is a well-known representative. Although it is reasonably effective, the loss of spectral details in the converted spectral envelopes inevitably deteriorates speech quality and similarity. To overcome this problem, we propose a novel exemplar-based spectral detail compensation method for VC. In the offline stage, the paired dictionaries of source spectral envelopes and target spectral details are constructed. In the online stage, the locally linear embedding (LLE) algorithm is applied to predict the target spectral details from the source spectral envelopes and then, the predicted spectral details are used to compensate the converted spectral envelopes obtained by a baseline GMM-based SC method with the STRAIGHT vocoder. Experimental results show that the proposed method can notably improve the baseline system in terms of objective and subjective tests.

#15 Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs [PDF] [Copy] [Kimi1]

Authors: G. Nisha Meenakshi ; Prasanta Kumar Ghosh

We propose a bidirectional long short-term memory (BLSTM) based whispered speech to neutral speech conversion system that employs the STRAIGHT speech synthesizer. We use a BLSTM to map the spectral features of whispered speech to those of neutral speech. Three other BLSTMs are employed to predict the pitch, periodicity levels and the voiced/unvoiced phoneme decisions from the spectral features of whispered speech. We use objective measures to quantify the quality of the predicted spectral features and excitation parameters, using data recorded from six subjects, in a four fold setup. We find that the temporal smoothness of the spectral features predicted using the proposed BLSTM based system is statistically more compared to that predicted using deep neural network based baseline schemes. We also observe that while the performance of the proposed system is comparable to the baseline scheme for pitch prediction, it is superior in terms of classifying voicing decisions and predicting periodicity levels. From subjective evaluation via listening test, we find that the proposed method is chosen as the best performing scheme 26.61% (absolute) more often than the best baseline scheme. This reveals that the proposed method yields a more natural sounding neutral speech from whispered speech.

#16 Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance [PDF] [Copy] [Kimi1]

Authors: Songxiang Liu ; Jinghua Zhong ; Lifa Sun ; Xixin Wu ; Xunying Liu ; Helen Meng

Developing a voice conversion (VC) system for a particular speaker typically requires considerable data from both the source and target speakers. This paper aims to effectuate VC across arbitrary speakers, which we call any-to-any VC, with only a single target-speaker utterance. Two systems are studied: (1) the i-vector-based VC (IVC) system and (2) the speaker-encoder-based VC (SEVC) system. Phonetic PosteriorGrams are adopted as speaker-independent linguistic features extracted from speech samples. Both systems train a multi-speaker deep bidirectional long-short term memory (DBLSTM) VC model, taking in additional inputs that encode speaker identities, in order to generate the outputs. In the IVC system, the speaker identity of a new target speaker is represented by i-vectors. In the SEVC system, the speaker identity is represented by speaker embedding predicted from a separately trained model. Experiments verify the effectiveness of both systems in achieving VC based only on a single target-speaker utterance. Furthermore, the IVC approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.

#17 Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations [PDF] [Copy] [Kimi1]

Authors: Ju-chieh Chou ; Cheng-chieh Yeh ; Hung-yi Lee ; Lin-shan Lee

Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker. In this paper, we propose an adversarial learning framework for voice conversion, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals. An autoencoder is first trained to extract speaker-independent latent representations and speaker embedding separately using another auxiliary speaker classifier to regularize the latent representation. The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance. The quality of decoder output is further improved by patching with the residual signal produced by another pair of generator and discriminator. A target speaker set size of 20 was tested in the preliminary experiments and very good voice quality was obtained. Conventional voice conversion metrics are reported. We also show that the speaker information has been properly reduced from the latent representations.

#18 Joint Noise and Reverberation Adaptive Learning for Robust Speaker DOA Estimation with an Acoustic Vector Sensor [PDF] [Copy] [Kimi1]

Authors: Disong Wang ; Yuexian Zou

Deep neural network (DNN) based DOA estimation (DNN-DOAest) methods report superior performance but the degradation is observed under stronger additive noise and room reverberation conditions. Motivated by our previous work with an acoustic vector sensor (AVS) and the great success of DNN based speech denoising and dereverberation (DNN-SDD), a unified DNN framework for robust DOA estimation task is thoroughly investigated in this paper. First, a novel DOA cue termed as sub-band inter-sensor data ratio (Sb-ISDR) is proposed to efficiently represent DOA information for training a DNN-DOAest model. Second, a speech-aware DNN-SDD is presented, where coherence vectors denoting the probability of time-frequency points dominated by speech signals are used as additional input to facilitate the training to predict complex ideal ratio masks. Last, by stacking the DNN-DOAest on the DNN-SDD with a joint part, the unified network is jointly fine-tuned, which enables DNN-SDD to serve as a pre-processing front-end to adaptively generate ‘clean’ speech features that are easier to be correctly classified by the following DNN-DOAest for robust DOA estimation. Experimental results on simulated and recorded data confirm the effectiveness and superiority of our proposed methods under different noise and reverberations compared with baseline methods.

#19 Multiple Concurrent Sound Source Tracking Based on Observation-Guided Adaptive Particle Filter [PDF] [Copy] [Kimi1]

Authors: Hong Liu ; Haipeng Lan ; Bing Yang ; Cheng Pang

Particle filter (PF) has been proved to be an effective tool to track sound sources. In traditional PF, a pre-defined dynamic model is used to model source motion, which tends to be mismatched due to the uncertainty of source motion. Besides, non-stationary interferences pose a severe challenge to source tracking. To this end, an observation-guided adaptive particle filter (OAPF) is proposed for multiple concurrent sound source tracking. Firstly, sensor signals are processed in the time-frequency domain to obtain the direction of arrival (DOA) observations of sources. Then, by updating particle states with these DOA observations, angular distances between particles and observations are reduced to guide particles to directions of sources. Thirdly, particle weights are updated by an interference-adaptive likelihood function to reduce the impacts of interferences. At last, with the updated particles and the corresponding weights, OAPF is utilized to determine the final DOAs of sources. Experimental results demonstrate that our method achieves favorable performance for multiple concurrent sound source tracking in noisy environments.

#20 Harmonic-Percussive Source Separation of Polyphonic Music by Suppressing Impulsive Noise Events [PDF] [Copy] [Kimi1]

Authors: Gurunath Reddy M ; K. Sreenivasa Rao ; Partha Pratim Das

In recent years, harmonic-percussive source separation methods are gaining importance because of their potential applications in many music information retrieval tasks. The goal of the decomposition methods is to achieve near real-time separation, distortion and artifact free component spectrograms and their equivalent time domain signals for potential music applications. In this paper, we propose a decomposition method based on filtering/suppressing the impulsive interference of percussive source on the harmonic components and impulsive interference of the harmonic source on the percussive components by modified moving average filter in the Fourier frequency domain. The significant advantage of the proposed method is that it minimizes the artifacts in the separated signal spectrograms. In this work, we have proposed Affine and Gain masking methods to separate the harmonic and percussive components to achieve minimal spectral leakage. The objective measures and separated spectrograms showed that the proposed method is better than the existing rank-order filtering based harmonic-percussive separation methods.

#21 Speaker Activity Detection and Minimum Variance Beamforming for Source Separation [PDF] [Copy] [Kimi1]

Authors: Enea Ceolini ; Jithendar Anumula ; Adrian Huber ; Ilya Kiselev ; Shih-Chii Liu

This work proposes a framework that renders minimum variance beamforming blind allowing for source separation in real world environments with an ad-hoc multi-microphone setup using no assumptions other than knowing the number of speakers. The framework allows for multiple active speakers at the same time and estimates the activity of every single speaker at flexible time resolution. These estimated speaker activities are subsequently used for the calibration of the beamforming algorithm. This framework is tested with three different speaker activity detection (SAD) methods, two of which use classical algorithms and one that is event-driven. Our methods, when tested in real world reverberant scenarios, can achieve very high signal-to-interference ratio (SIR) of around 20 dB and sound quality of 0.85 in short-time objective intelligibility (STOI) close to optimal beamforming results of 22 dB SIR and 0.89 in STOI.

#22 Sparsity-Constrained Weight Mapping for Head-Related Transfer Functions Individualization from Anthropometric Features [PDF] [Copy] [Kimi1]

Authors: Xiaoke Qi ; Jianhua Tao

Head-related transfer functions (HRTFs) describe the propagation of sound waves from the sound source to ear drums, which contain most of information for localization. However, HRTFs are highly individual-dependent and thus because of the difference of anthropometric features between subjects, individualization of HRTFs is a great challenge for accurate localization perception in virtual auditory displays (VAD). In this paper, we propose a sparsity-constrained weight mapping method termed SWM to obtain individual HRTFs. The key idea behind SWM is to obtain optimal weights to combine HRTFs from the training subjects based on the relationship of anthropometric features between the target subject and the training subjects. To this end, SWM learns two sparse representations between the target subject and the training subjects in terms of anthropometric features and HRTFs, respectively. A non-negative sparse model is used for this purpose when considering the non-negative property of the anthropometric features. Then, we build a mapping between the two weight vectors using a nonlinear regression. Furthermore, an iterative data extension method is proposed in order to increase training samples for mapping model. The objective and subjective experimental results show that the proposed method outperforms other methods in terms of log-spectral distortion (LSD) and localization accuracy.

#23 Speech Source Separation Using ICA in Constant Q Transform Domain [PDF] [Copy] [Kimi1]

Authors: D.V.L.N Dheeraj Sai ; K. S. Kishor ; K Sri Rama Murty

In order to separate individual sources from convoluted speech mixtures, complex-domain independent component analysis (ICA) is employed on the individual frequency bins of time frequency representations of the speech mixtures, obtained using short-time Fourier transform (STFT). The frequency components computed using STFT are separated by constant frequency difference with a constant frequency resolution. However, it is well known that the human auditory mechanism offers better resolution at lower frequencies. Hence, the perceptual quality of the extracted sources critically depends on the separation achieved in the lower frequency components. In this paper, we propose to perform source separation on the time-frequency representation computed though constant Q transform (CQT), which offers non uniform logarithmic binning in the frequency domain. Complex-domain ICA is performed on the individual bins of the CQT in order to get separated components in each frequency bin which are suitably scaled and permuted to obtain separated sources in the CQT domain. The estimated sources are obtained by applying inverse constant Q transform to the scaled and permuted sources. In comparison with the STFT based frequency domain ICA methods, there has been a consistent improvement of 3 dB or more in the Signal to Interference Ratios of the extracted sources.

#24 Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming [PDF] [Copy] [Kimi1]

Authors: Lu Yin ; Ziteng Wang ; Risheng Xia ; Junfeng Li ; Yonghong Yan

The recently proposed Permutation Invariant Training (PIT) technique addresses the label permutation problem for multi-talker speech separation. It has shown to be effective for the single-channel separation case. In this paper, we propose to extend the PIT-based technique to the multichannel multi-talker speech separation scenario. PIT is used to train a neural network that outputs masks for each separate speaker which is followed by a Minimum Variance Distortionless Response (MVDR) beamformer. The beamformer utilizes the spatial information of different speakers and alleviates the performance degradation due to misaligned labels. Experimental results show that the proposed PIT-MVDR-based technique leads to higher Signal-to-Distortion Ratios (SDRs) compared to the single-channel speech separation method when tested on two-speaker and three-speaker mixtures.

#25 Expectation-Maximization Algorithms for Itakura-Saito Nonnegative Matrix Factorization [PDF] [Copy] [Kimi1]

Authors: Paul Magron ; Tuomas Virtanen

This paper presents novel expectation-maximization (EM) algorithms for estimating the nonnegative matrix factorization model with Itakura-Saito divergence. Indeed, the common EM-based approach exploits the space-alternating generalized EM (SAGE) variant of EM but it usually performs worse than the conventional multiplicative algorithm. We propose to explore more exhaustively those algorithms, in particular the choice of the methodology (standard EM or SAGE variant) and the latent variable set (full or reduced). We then derive four EM-based algorithms, among which three are novel. Speech separation experiments show that one of those novel algorithms using a standard EM methodology and a reduced set of latent variables outperforms its SAGE variants and competes with the conventional multiplicative algorithm.